{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic neural bag-of-words text classifier with Thinc\n",
    "\n",
    "This notebook shows how to implement a simple neural text classification model in Thinc. Last tested with `thinc==8.0.0a9`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install thinc syntok \"ml_datasets>=0.2.0a0\" tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For simple and standalone tokenization, we'll use the [`syntok`](https://github.com/fnl/syntok) package and the following function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from syntok.tokenizer import Tokenizer\n",
    "\n",
    "def tokenize_texts(texts):\n",
    "    tok = Tokenizer()\n",
    "    return [[token.value for token in tok.tokenize(text)] for text in texts]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up the data\n",
    "\n",
    "The `load_data` function loads the DBPedia Ontology dataset, converts and tokenizes the data and generates a simple vocabulary mapping. Instead of `ml_datasets.dbpedia` you can also try `ml_datasets.imdb` for the IMDB review dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ml_datasets\n",
    "import numpy\n",
    "\n",
    "def load_data():\n",
    "    train_data, dev_data = ml_datasets.dbpedia(train_limit=2000, dev_limit=2000)\n",
    "    train_texts, train_cats = zip(*train_data)\n",
    "    dev_texts, dev_cats = zip(*dev_data)\n",
    "    unique_cats = list(numpy.unique(numpy.concatenate((train_cats, dev_cats))))\n",
    "    nr_class = len(unique_cats)\n",
    "    print(f\"{len(train_data)} training / {len(dev_data)} dev\\n{nr_class} classes\")\n",
    "\n",
    "    train_y = numpy.zeros((len(train_cats), nr_class), dtype=\"f\")\n",
    "    for i, cat in enumerate(train_cats):\n",
    "        train_y[i][unique_cats.index(cat)] = 1\n",
    "    dev_y = numpy.zeros((len(dev_cats), nr_class), dtype=\"f\")\n",
    "    for i, cat in enumerate(dev_cats):\n",
    "        dev_y[i][unique_cats.index(cat)] = 1\n",
    "\n",
    "    train_tokenized = tokenize_texts(train_texts)\n",
    "    dev_tokenized = tokenize_texts(dev_texts)\n",
    "    # Generate simple vocab mapping, <unk> is 0\n",
    "    vocab = {}\n",
    "    count_id = 1\n",
    "    for text in train_tokenized:\n",
    "        for token in text:\n",
    "            if token not in vocab:\n",
    "                vocab[token] = count_id\n",
    "                count_id += 1\n",
    "    # Map texts using vocab\n",
    "    train_X = []\n",
    "    for text in train_tokenized:\n",
    "        train_X.append(numpy.array([vocab.get(t, 0) for t in text]))\n",
    "    dev_X = []\n",
    "    for text in dev_tokenized:\n",
    "        dev_X.append(numpy.array([vocab.get(t, 0) for t in text]))\n",
    "    return (train_X, train_y), (dev_X, dev_y), vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the model and config\n",
    "\n",
    "The model takes a list of 2-dimensional arrays (the tokenized texts mapped to vocab IDs) and outputs a 2d array. Because the embed layer's `nV` dimension (the number of entries in the lookup table) depends on the vocab and the training data, it's passed in as an argument and registered as a **reference**. This makes it easy to retrieve it later on by calling `model.get_ref(\"embed\")`, so we can set its `nV` dimension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "import thinc\n",
    "from thinc.api import Model, chain, list2ragged, with_array, reduce_mean, Softmax\n",
    "from thinc.types import Array2d\n",
    "\n",
    "@thinc.registry.layers(\"EmbedPoolTextcat.v1\")\n",
    "def EmbedPoolTextcat(embed: Model[Array2d, Array2d]) -> Model[List[Array2d], Array2d]:\n",
    "    with Model.define_operators({\">>\": chain}):\n",
    "        model = with_array(embed) >> list2ragged() >> reduce_mean() >> Softmax()\n",
    "    model.set_ref(\"embed\", embed)\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The config defines the top-level model using the registered `EmbedPoolTextcat` function, and the `embed` argument, referencing the `Embed` layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CONFIG = \"\"\"\n",
    "[hyper_params]\n",
    "width = 64\n",
    "\n",
    "[model]\n",
    "@layers = \"EmbedPoolTextcat.v1\"\n",
    "\n",
    "[model.embed]\n",
    "@layers = \"Embed.v1\"\n",
    "nO = ${hyper_params:width}\n",
    "\n",
    "[optimizer]\n",
    "@optimizers = \"Adam.v1\"\n",
    "learn_rate = 0.001\n",
    "\n",
    "[training]\n",
    "batch_size = 8\n",
    "n_iter = 10\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training setup\n",
    "\n",
    "When the config is loaded, it's first parsed as a dictionary and all references to values from other sections, e.g. `${hyper_params:width}` are replaced. The result is a nested dictionary describing the objects defined in the config. `registry.resolve` then creates the objects and calls the functions **bottom-up**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from thinc.api import registry, Config\n",
    "\n",
    "C = registry.resolve(Config().from_str(CONFIG))\n",
    "C"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the data is loaded, we'll know the vocabulary size and can set the dimension on the embedding layer. `model.get_ref(\"embed\")` returns the layer defined as the ref `\"embed\"` and the `set_dim` method lets you set a value for a dimension. To fill in the other missing shapes, we can call `model.initialize` with some input and output data.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(train_X, train_y), (dev_X, dev_y), vocab = load_data()\n",
    "\n",
    "batch_size = C[\"training\"][\"batch_size\"]\n",
    "optimizer = C[\"optimizer\"]\n",
    "model = C[\"model\"]\n",
    "model.get_ref(\"embed\").set_dim(\"nV\", len(vocab) + 1)\n",
    "\n",
    "model.initialize(X=train_X, Y=train_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_model(model, dev_X, dev_Y, batch_size):\n",
    "    correct = 0.0\n",
    "    total = 0.0\n",
    "    for X, Y in model.ops.multibatch(batch_size, dev_X, dev_Y):\n",
    "        Yh = model.predict(X)\n",
    "        for j in range(len(Yh)):\n",
    "            correct += Yh[j].argmax(axis=0) == Y[j].argmax(axis=0)\n",
    "        total += len(Y)\n",
    "    return float(correct / total)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Training the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from thinc.api import fix_random_seed\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "fix_random_seed(0)\n",
    "for n in range(C[\"training\"][\"n_iter\"]):\n",
    "    loss = 0.0\n",
    "    batches = model.ops.multibatch(batch_size, train_X, train_y, shuffle=True)\n",
    "    for X, Y in tqdm(batches, leave=False):\n",
    "        Yh, backprop = model.begin_update(X)\n",
    "        d_loss = []\n",
    "        for i in range(len(Yh)):\n",
    "            d_loss.append(Yh[i] - Y[i])\n",
    "            loss += ((Yh[i] - Y[i]) ** 2).sum()\n",
    "        backprop(numpy.array(d_loss))\n",
    "        model.finish_update(optimizer)\n",
    "    score = evaluate_model(model, dev_X, dev_y, batch_size)\n",
    "    print(f\"{n}\\t{loss:.2f}\\t{score:.3f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
